In this dataset, we want to answer serveral questions: 1: Which carrier has the worst punctuality in departuring time and arriving time?Which carrier has the best punctuality in departuring time and arriving time? 2: Which carrier has done the best job to minimize CarrierDelay? 3: What is the best time of day to fly to minimize delays? 4: What is the best time of year to fly to minimize delays? 5: How do patterns of flights to different destinations or parts of the country change over the course of the year?
We will talk about each problem one by one.
ABIA <-read.csv("data/ABIA.csv", header=T, na.strings=c("","NA"))
attach(ABIA)
library(dplyr)
library(plotly)
p1 <- plot_ly(ggplot2::diamonds, y = ArrDelay, color = UniqueCarrier, type = "box")
p1
temp<- ABIA[which(ArrDelay!=649 & ArrDelay!= 948),]
p2 <- plot_ly(ggplot2::diamonds, y = temp$ArrDelay, color = temp$UniqueCarrier, type = "box")
p2
We could actually see that OH flight is the worst when it comes to punctuality on arriving.
temp1<- ABIA[which(DepDelay!=665 & DepDelay!= 875),]
p3 <- plot_ly(ggplot2::diamonds, y = temp1$DepDelay, color = temp1$UniqueCarrier, type = "box")
p3
We could actually see that OH and EV flight are the worst when it comes to punctuality on departing. US airline did a much better job on departure punctuality
temp2<-ABIA[!is.na(CarrierDelay),]
attach(temp2)
#we could see that ArrDelay consist of five parts:
sum(temp2[,'ArrDelay']!=temp2[,'WeatherDelay']+temp2[,'NASDelay']+temp2[,'SecurityDelay']+ temp2[,'LateAircraftDelay']+temp2[,'CarrierDelay'])
## [1] 0
#let's get rid of the WeatherDelay, which people can not control
temp2[,'ArrDelay_control']=temp2[,'NASDelay']+temp2[,'SecurityDelay']+
temp2[,'LateAircraftDelay']+temp2[,'CarrierDelay']
p4 <- plot_ly(ggplot2::diamonds, y = temp2[which(CarrierDelay!=875),]$CarrierDelay,
color = temp2[which(CarrierDelay!=875),]$UniqueCarrier, type = "box")
p4
We could see that airline MQ did the best job in term of its own control of arriving time. Airline YV did a worest jobin term of its own control of arriving time. But this is actually part of the total list. So maybe, airline didn’t accord many records. So let’s take a look at the records of each airline.
temp2_freq=data.frame(table(temp2$UniqueCarrier))
ABIA_freq=data.frame(table(ABIA$UniqueCarrier))
record_freq <- cbind(temp2_freq[1],temp2_freq[-1]/ABIA_freq[-1])
record_freq
## Var1 Freq
## 1 9E 0.1647705
## 2 AA 0.2099025
## 3 B6 0.2280117
## 4 CO 0.2087757
## 5 DL 0.2708529
## 6 EV 0.2654545
## 7 F9 0.1646341
## 8 MQ 0.1794968
## 9 NW 0.2396694
## 10 OH 0.3382451
## 11 OO 0.1967621
## 12 UA 0.2304394
## 13 US 0.1207133
## 14 WN 0.1731563
## 15 XE 0.1933738
## 16 YV 0.2234682
#record_freq
p <- plot_ly(
x = record_freq$Var1,
y = record_freq$Freq,
name = "record freq",
type = "bar")
p
We could see that actually MQ and YV give out the relatively good records on the five delay components. So we could say that, airline MQ did a good job on control its CarrierDelay time.YV did the worest. So for airline YV, it is important to improve its own control of arriving time
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
ABIA$DepTime<-format(strptime(sprintf("%04d", ABIA$DepTime), format="%H%M"), format = "%H:%M")
ABIA$DepTime<-hm(ABIA$DepTime)
ABIA$DepTime_hour<-hour(ABIA$DepTime)
p5 <- plot_ly(ggplot2::diamonds, y = ABIA$DepDelay, color =as.character(ABIA$DepTime_hour), type = "box")
p5
ABIA$ArrTime<-format(strptime(sprintf("%04d", ABIA$ArrTime), format="%H%M"), format = "%H:%M")
ABIA$ArrTime<-hm(ABIA$ArrTime)
ABIA$ArrTime_hour<-hour(ABIA$ArrTime)
p6 <- plot_ly(ggplot2::diamonds, y = ABIA$ArrDelay, color =as.character(ABIA$ArrTime_hour), type = "box")
p6
We can see from the plot that, 5:00am is the best time of the day to fly to minimize delays. In the early morning, 5:00am to 7:00am is a good period of time to fly to minimize delays.
p7 <- plot_ly(ggplot2::diamonds, y = ABIA$DepDelay, color =as.character(ABIA$Month), type = "box")
p7
p8 <- plot_ly(ggplot2::diamonds, y = ABIA$ArrDelay, color =as.character(ABIA$Month), type = "box")
p8
From the plots we could see that, September is the best month of the year to fly to minimize delays.
library(plyr)
group<- ABIA %>%
dplyr::group_by(Month,Dest) %>%
dplyr::summarise(length(Dest))
par(mfrow=c(2,2))
colors <- rainbow(length(unique(group$Dest)))
linetype <- c(1:length(unique(group$Dest)))
plotchar <- seq(18,18+length(unique(group$Dest)),1)
for (i in (1:length(unique(group$Dest)))) {
temp3 <- subset(group,Dest==unique(group$Dest)[i])
xrange <- range(1:12)
yrange<- range(temp3$`length(Dest)`)
plot(xrange, yrange, type="n", xlab="Month",ylab=paste("Number of Flights to ", as.character(unique(group$Dest)[i]), sep = ""))
lines(temp3$Month, temp3$`length(Dest)`, type="b", lwd=1.5)
line <- readline()
}
There is clearly a drop in the frequency of the flights to some destinations starting around September. These destinations include ABQ, AUS, CLE, CVG, IAD, LAS, MEM, MSY, ONT, RDU, SLC, SNA, TUS, SEA. The frequency of the flights to some destinations even drop around June, which include BNA, DAL, MAF, MCI.
library(compiler)
enableJIT(3)
## [1] 3
rm(list=ls())
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 679899 36.4 1168576 62.5 1168576 62.5
## Vcells 3331504 25.5 10405454 79.4 13004239 99.3
Import arules library for association rule mining.
library(arules)
Import data with read.transactions() function from arules, which will automatically convert each row into a list of items separated by commas, and the returned object is a transactions object. This function will also drop duplicate items from each basket if rm.duplicates = TRUE.
groceries <- read.transactions("data/groceries.txt", sep=",", rm.duplicates = TRUE)
We can assign each user an id by converting the transactions to a list with as(from="transactions", to="list"), and define names() of the list, and then convert the list back to transactions.
groceries_list <- as(groceries, "list")
names(groceries_list) <- as.character(1:length(groceries_list))
groceries <- as(groceries_list, "transactions")
At this point, the groceries object is suitable for a priori analysis. Before applying the apriori function, we need to determine what the support and confidence thresholds, and maxlen value. This is essentially a heuristic process, so here let’s first try a higher support level 0.01 and confidence threshold 0.55 (just a little bit more than 0.5), and see what we get.
params <- list(support=.01, confidence=.55, maxlen=4)
grocery_rules <- apriori(groceries, parameter = params)
> Apriori
>
> Parameter specification:
> confidence minval smax arem aval originalSupport support minlen maxlen
> 0.55 0.1 1 none FALSE TRUE 0.01 1 4
> target ext
> rules FALSE
>
> Algorithmic control:
> filter tree heap memopt load sort verbose
> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
>
> Absolute minimum support count: 98
>
> set item appearances ...[0 item(s)] done [0.00s].
> set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
> sorting and recoding items ... [88 item(s)] done [0.00s].
> creating transaction tree ... done [0.00s].
> checking subsets of size 1 2 3 4 done [0.00s].
> writing ... [7 rule(s)] done [0.00s].
> creating S4 object ... done [0.00s].
inspect(subset(grocery_rules, subset=lift>=2))
> lhs rhs support confidence lift
> 1 {curd,
> yogurt} => {whole milk} 0.01006609 0.5823529 2.279125
> 2 {butter,
> other vegetables} => {whole milk} 0.01148958 0.5736041 2.244885
> 3 {domestic eggs,
> other vegetables} => {whole milk} 0.01230300 0.5525114 2.162336
> 4 {citrus fruit,
> root vegetables} => {other vegetables} 0.01037112 0.5862069 3.029608
> 5 {root vegetables,
> tropical fruit} => {other vegetables} 0.01230300 0.5845411 3.020999
> 6 {root vegetables,
> tropical fruit} => {whole milk} 0.01199797 0.5700483 2.230969
> 7 {root vegetables,
> yogurt} => {whole milk} 0.01453991 0.5629921 2.203354
Here we only selected the associations with lift greater than 2, which gives 7 in total. And among them are items such as other vegetables and whole milk, which are themselves frequent terms across all baskets. Such results provide limited information, so we need to look closer into more interesting and less ubiquitous items.
So a natural question to be asked here is, which are the most frequent items? Let’s make a plot to show the top 10 frequent terms.
itemFrequencyPlot(groceries, topN=10)
So here we can see clearly that whole milk and other vegetables are indeed frequent terms containing relatively less information.
Therefore, let’s lower support level to 0.001 to include less often items. Also, when printing out the associations, we raise the lift threshold to 10, which indicates highly correlated and dependent items.
params <- list(support=.001, confidence=.55, maxlen=4)
grocery_rules <- apriori(groceries, parameter = params)
> Apriori
>
> Parameter specification:
> confidence minval smax arem aval originalSupport support minlen maxlen
> 0.55 0.1 1 none FALSE TRUE 0.001 1 4
> target ext
> rules FALSE
>
> Algorithmic control:
> filter tree heap memopt load sort verbose
> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
>
> Absolute minimum support count: 9
>
> set item appearances ...[0 item(s)] done [0.00s].
> set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
> sorting and recoding items ... [157 item(s)] done [0.00s].
> creating transaction tree ... done [0.00s].
> checking subsets of size 1 2 3 4 done [0.01s].
> writing ... [3314 rule(s)] done [0.00s].
> creating S4 object ... done [0.00s].
inspect(subset(grocery_rules, subset=lift>=10))
> lhs rhs support confidence lift
> 1 {liquor,
> red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.23527
> 2 {popcorn,
> soda} => {salty snack} 0.001220132 0.6315789 16.69779
> 3 {Instant food products,
> soda} => {hamburger meat} 0.001220132 0.6315789 18.99565
> 4 {ham,
> processed cheese} => {white bread} 0.001931876 0.6333333 15.04549
> 5 {baking powder,
> flour} => {sugar} 0.001016777 0.5555556 16.40807
> 6 {hard cheese,
> whipped/sour cream,
> yogurt} => {butter} 0.001016777 0.5882353 10.61522
> 7 {hamburger meat,
> whipped/sour cream,
> yogurt} => {butter} 0.001016777 0.6250000 11.27867
Here we see some intriguing associations which are less frequent among all baskets but exhibits huge correlation in terms of lift, which is a measure of dependence. While in the previous case there were only 7 associations even with lift level higher than 2, here there are 7 associations with lift higher than 10.
Let’s look at the association {liquor, red/blush wine} => {bottled beer}, it is intuitively this is some combination appealing to an alcohol lover. This intuition also holds for other association groups, such as {baking powder, flour} => {sugar} which is probably a part of common baking recipe.
And what about other associations?
params <- list(support=.001, confidence=.55, maxlen=4)
grocery_rules <- apriori(groceries, parameter = params)
> Apriori
>
> Parameter specification:
> confidence minval smax arem aval originalSupport support minlen maxlen
> 0.55 0.1 1 none FALSE TRUE 0.001 1 4
> target ext
> rules FALSE
>
> Algorithmic control:
> filter tree heap memopt load sort verbose
> 0.1 TRUE TRUE FALSE TRUE 2 TRUE
>
> Absolute minimum support count: 9
>
> set item appearances ...[0 item(s)] done [0.00s].
> set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
> sorting and recoding items ... [157 item(s)] done [0.00s].
> creating transaction tree ... done [0.00s].
> checking subsets of size 1 2 3 4 done [0.01s].
> writing ... [3314 rule(s)] done [0.00s].
> creating S4 object ... done [0.00s].
inspect(subset(grocery_rules, subset=(lift<=10 & lift>8)))
> lhs rhs support confidence lift
> 1 {frozen vegetables,
> specialty chocolate} => {fruit/vegetable juice} 0.001016777 0.6250000 8.645394
> 2 {frozen fish,
> other vegetables,
> tropical fruit} => {pip fruit} 0.001016777 0.6666667 8.812724
> 3 {flour,
> root vegetables,
> whole milk} => {whipped/sour cream} 0.001728521 0.5862069 8.177794
> 4 {misc. beverages,
> other vegetables,
> tropical fruit} => {fruit/vegetable juice} 0.001016777 0.5882353 8.136841
> 5 {citrus fruit,
> fruit/vegetable juice,
> grapes} => {tropical fruit} 0.001118454 0.8461538 8.063879
> 6 {fruit/vegetable juice,
> grapes,
> tropical fruit} => {citrus fruit} 0.001118454 0.6875000 8.306588
> 7 {citrus fruit,
> grapes,
> tropical fruit} => {fruit/vegetable juice} 0.001118454 0.6111111 8.453274
> 8 {butter,
> hard cheese,
> yogurt} => {whipped/sour cream} 0.001016777 0.6250000 8.718972
> 9 {butter,
> hard cheese,
> other vegetables} => {whipped/sour cream} 0.001220132 0.6000000 8.370213
> 10 {butter,
> hard cheese,
> whole milk} => {whipped/sour cream} 0.001423488 0.6666667 9.300236
> 11 {ham,
> other vegetables,
> tropical fruit} => {pip fruit} 0.001626843 0.6153846 8.134822
> 12 {butter,
> sliced cheese,
> whole milk} => {whipped/sour cream} 0.001220132 0.6000000 8.370213
> 13 {cream cheese,
> sugar,
> whole milk} => {domestic eggs} 0.001118454 0.5500000 8.668670
> 14 {curd,
> sugar,
> yogurt} => {whipped/sour cream} 0.001016777 0.6250000 8.718972
> 15 {butter,
> other vegetables,
> sugar} => {whipped/sour cream} 0.001016777 0.7142857 9.964539
> 16 {citrus fruit,
> cream cheese,
> whole milk} => {domestic eggs} 0.001626843 0.5714286 9.006410
> 17 {domestic eggs,
> frankfurter,
> tropical fruit} => {pip fruit} 0.001016777 0.6250000 8.261929
> 18 {shopping bags,
> tropical fruit,
> whipped/sour cream} => {pip fruit} 0.001118454 0.6470588 8.553526
Above are associations with lift between 8 and 10. And here we can see some interesting combinations such as {butter, hard cheese, milk} => {whipped/sour cream}. Why is the customer buying such protein and fat heavy foots altogether? Probably it is simply because of the way these products are placed in the store. If some products are placed together, then they are more likely to be sold in a bundle. This can also be seen in {citrus fruit, grapes, tropical fruit} => {fruit/vegetable juice}.